{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# External vectorization\n",
        "\n",
        "[txtai](https://github.com/neuml/txtai) is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.\n",
        "\n",
        "Vectorization is the process of transforming data into numbers using machine learning models. Input data is run through a model and fixed dimension vectors are returned. These vectors can then be loaded into a vector database for similarity search.\n",
        "\n",
        "txtai is an open-source first system. Given it's own open-source roots, like-minded projects such as [sentence-transformers](https://github.com/UKPLab/sentence-transformers) are prioritized during development. But that doesn't mean txtai can't work with Embeddings API services.\n",
        "\n",
        "This notebook will show to use txtai with external vectorization."
      ],
      "metadata": {
        "id": "SwgRD_NGutB9"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ],
      "metadata": {
        "id": "68Iw-nPbhYIJ"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "R0AqRP7v1hdr"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Create an Embeddings dataset\n",
        "\n",
        "The first thing we'll do is pre-compute an embeddings dataset. In addition to Embeddings APIs, this can also be used during internal testing to tune index and database settings."
      ],
      "metadata": {
        "id": "_F84mqRHwpCP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai import Embeddings\n",
        "\n",
        "# Load dataset\n",
        "wikipedia = Embeddings()\n",
        "wikipedia.load(provider=\"huggingface-hub\", container=\"neuml/txtai-wikipedia\")\n",
        "\n",
        "# Query for Top 10,000 most popular articles\n",
        "query = \"\"\"\n",
        "SELECT id, text FROM txtai\n",
        "order by percentile desc\n",
        "LIMIT 10000\n",
        "\"\"\"\n",
        "\n",
        "data = wikipedia.search(query)\n",
        "\n",
        "# Encode vectors using same vector model as Wikipedia\n",
        "vectors = wikipedia.batchtransform(x[\"text\"] for x in data)\n",
        "\n",
        "# Build dataset of id, text, embeddings\n",
        "dataset = []\n",
        "for i, row in enumerate(data):\n",
        "  dataset.append({\"id\": row[\"id\"], \"article\": row[\"text\"], \"embeddings\": vectors[i]})\n"
      ],
      "metadata": {
        "id": "U52gN-vUxjky"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Build an Embeddings index with external vectors\n",
        "\n",
        "Next, we'll create an Embedding index with an external transform function set.\n",
        "\n",
        "The external transform function can be any function or callable object. This function takes an array of data and returns an array of embeddings."
      ],
      "metadata": {
        "id": "Ggo-N9iOQJ4d"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def transform(inputs):\n",
        "  return wikipedia.batchtransform(inputs)\n",
        "\n",
        "def stream():\n",
        "  for row in dataset:\n",
        "    # Index vector\n",
        "    yield row[\"id\"], row[\"embeddings\"]\n",
        "\n",
        "    # Index metadata\n",
        "    yield {\"id\": row[\"id\"], \"article\": row[\"article\"]}\n",
        "\n",
        "embeddings = Embeddings(transform=\"__main__.transform\", content=True)\n",
        "embeddings.index(stream())"
      ],
      "metadata": {
        "id": "iujXcHzMCd0B"
      },
      "execution_count": 64,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "🚀 Notice how fast creating the index was compared to indexing. This is because there is no vectorization! Now let's run a query."
      ],
      "metadata": {
        "id": "B0Se6nG7JxOG"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"select id, article, score from txtai where similar(:x)\", parameters={\"x\": \"operating system\"})"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Bi0WnvosEL8E",
        "outputId": "d190e8c1-cb08-4d40-989c-aed4a67edc06"
      },
      "execution_count": 63,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': 'Operating system',\n",
              "  'article': 'An operating system (OS) is system software that manages computer hardware and software resources, and provides common services for computer programs.',\n",
              "  'score': 0.8955847024917603},\n",
              " {'id': 'MacOS',\n",
              "  'article': \"macOS (;), originally Mac\\xa0OS\\xa0X, previously shortened as OS\\xa0X, is an operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and laptop computers, it is the second most widely used desktop OS, after Microsoft Windows and ahead of all Linux distributions, including ChromeOS.\",\n",
              "  'score': 0.8666583299636841},\n",
              " {'id': 'Linux',\n",
              "  'article': 'Linux is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution (distro), which includes the kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word \"Linux\" in their name, but the Free Software Foundation uses and recommends the name \"GNU/Linux\" to emphasize the use and importance of GNU software in many distributions, causing some controversy.',\n",
              "  'score': 0.839817225933075}]"
            ]
          },
          "metadata": {},
          "execution_count": 63
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "All as expected! This method can also be used with existing datasets on the Hugging Face Hub."
      ],
      "metadata": {
        "id": "jUxUBP0vKUZo"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Integrate with Embeddings API services\n",
        "\n",
        "Next, we'll integrate with an Embeddings API service to build vectors.\n",
        "\n",
        "The code below interfaces with the Hugging Face Inference API. This can easily be switched to OpenAI, Cohere or even your own local API."
      ],
      "metadata": {
        "id": "lUU8WlIbKnOi"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import requests\n",
        "\n",
        "BASE = \"https://api-inference.huggingface.co/pipeline/feature-extraction\"\n",
        "\n",
        "def transform(inputs):\n",
        "  # Your API provider of choice\n",
        "  response = requests.post(f\"{BASE}/sentence-transformers/nli-mpnet-base-v2\", json={\"inputs\": inputs})\n",
        "  return np.array(response.json(), dtype=np.float32)\n",
        "\n",
        "data = [\n",
        "  \"US tops 5 million confirmed virus cases\",\n",
        "  \"Canada's last fully intact ice shelf has suddenly collapsed, \" +\n",
        "  \"forming a Manhattan-sized iceberg\",\n",
        "  \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n",
        "  \"The National Park Service warns against sacrificing slower friends \" +\n",
        "  \"in a bear attack\",\n",
        "  \"Maine man wins $1M from $25 lottery ticket\",\n",
        "  \"Make huge profits without work, earn up to $100,000 a day\"\n",
        "]\n",
        "\n",
        "embeddings = Embeddings({\"transform\": transform, \"backend\": \"numpy\", \"content\": True})\n",
        "embeddings.index(data)\n",
        "embeddings.search(\"feel good story\", 1)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "9-V6OIEtLEIs",
        "outputId": "e2492e33-27f7-48f3-e409-5c63e50c7f35"
      },
      "execution_count": 74,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': '4',\n",
              "  'text': 'Maine man wins $1M from $25 lottery ticket',\n",
              "  'score': 0.08329013735055923}]"
            ]
          },
          "metadata": {},
          "execution_count": 74
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "This is the classic txtai tutorial example. Except this time, vectorization is run with an external API service!"
      ],
      "metadata": {
        "id": "uW8pDvD5Ms2r"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Wrapping up\n",
        "\n",
        "This notebook showed how txtai can integrate with external vectorization. This can be a dataset with pre-computed embeddings and/or an Embeddings API service.\n",
        "\n",
        "Each of txtai's components can be fully customized and vectorization is no exception. Flexibility and customization for the win!"
      ],
      "metadata": {
        "id": "8wGf3O_2YGbd"
      }
    }
  ]
}